Skip to content

Support JSON reader#64830

Merged
Gabriel39 merged 7 commits into
apache:refact_reader_branchfrom
Gabriel39:dev_0625
Jun 26, 2026
Merged

Support JSON reader#64830
Gabriel39 merged 7 commits into
apache:refact_reader_branchfrom
Gabriel39:dev_0625

Conversation

@Gabriel39

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 did not have a native JSON file reader. This change adds JSON as a supported v2 file format, wires it through table reader creation and hive reader schema mapping, and implements JSON parsing/materialization directly in the v2 reader without delegating to the legacy NewJsonReader path. Unit tests cover line JSON, outer array documents, json_root, jsonpaths, requested column ordering, nullable missing fields, required missing fields, strict malformed JSON errors, and ignore-malformed null rows.

### Release note

Support JSON reader in file scanner v2.

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added JsonReaderTest coverage for different JSON input scenarios.
    - Ran git diff --check.
    - Ran build-support/check-format.sh.
    - Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: Yes. File scanner v2 can create a native JSON reader for FORMAT_JSON.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add comments to the file scanner v2 JSON reader interfaces and non-obvious implementation paths, including synthetic schema handling, requested column mapping, simdjson buffer lifetime, json_root/jsonpaths behavior, duplicate key handling, and malformed-row rollback.

### Release note

None

### Check List (For Author)

- Test: No need to test (comment-only change)
    - Ran git diff --check.
    - Ran build-support/check-format.sh.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The JsonReaderTest helper always set a valid null indicator for constructed slot descriptors, so SlotDescriptor treated even non-nullable DataTypeString slots as nullable. This made the missing-required-column test expect an error while the test input actually described a nullable slot. Fix the helper to set nullIndicatorBit to -1 for non-nullable types.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Ran git diff --check.
    - Ran build-support/check-format.sh.
    - Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.ReturnsErrorForMissingRequiredColumn', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 JSON reader had three regressions. First, openx_json_ignore_malformed appended an all-null row for malformed records, while Hive/OpenX semantics skip malformed records. Second, empty JSON lines were still passed to simdjson and failed with EMPTY. Third, the reader assumed nullable output columns always had nullable source serdes and called get_nested_serdes on scalar serdes, which broke CDC TVF JSON rows whose output file column is nullable but source slot serde is not. This change skips malformed rows, treats empty JSON lines as empty rows, and only unwraps serdes when the source type is actually nullable.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added JsonReaderTest coverage for present required columns, ignored malformed rows, and empty JSON lines.
    - Ran git diff --check.
    - Ran build-support/check-format.sh.
    - Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The file scanner v2 JSON reader exposed every synthetic file column as nullable. For CDC TVF sources with non-nullable columns, this made the table mapper see Nullable(INT) where the source slot was INT and produced a mapping projection cast failure. This change preserves the source slot nullability in the JSON file schema while keeping the existing runtime handling for nullable output columns.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Updated JsonReaderTest to assert nullable and non-nullable file schema columns.
    - Ran git diff --check.
    - Ran build-support/check-format.sh.
    - Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The file scanner v2 JSON reader now skips malformed rows when openx_json_ignore_malformed is enabled. The Hive openx JSON regression expected the old behavior that materialized malformed rows as all NULL values, so the expected output failed against the corrected result. This updates the expected q1 output to contain only the valid JSON rows.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check.
    - Did not run the external Hive regression locally because the external Hive test environment is not available in this workspace.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 JSON reader incorrectly skipped malformed JSON documents when openx_json_ignore_malformed was enabled. The existing OpenX JSON reader semantics materialize one all-NULL row for each ignored malformed document when all projected columns are nullable. This change restores that compatibility by rolling back any partial writes and appending a NULL row for malformed documents, and updates the JsonReader unit test and Hive regression expected output accordingly.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Ran git diff --check.
    - Formatted changed BE C++ files with Homebrew clang-format 16.0.6.
    - Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*' with JDK17; sandbox execution failed because nproc is unavailable, submodule config writes are denied, and GitHub dependency download DNS is blocked. Escalated retries timed out before execution.
- Behavior changed: No
- Does this need documentation: No
@Gabriel39 Gabriel39 merged commit 267891d into apache:refact_reader_branch Jun 26, 2026
31 of 36 checks passed
Gabriel39 added a commit that referenced this pull request Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants